CellProfiler is designed to analyze images in a high-throughput manner.
Once a pipeline has been established for a set of images, CellProfiler
can export batches of images to be analyzed on a computing cluster with the
pipeline.
It is possible to process tens or even hundreds of thousands of
images for one analysis in this manner. We do this by breaking the entire
set of images into separate batches, then submitting each of these batches
as individual jobs to a cluster. Each individual batch can be separately
analyzed from the rest.
Submitting files for batch processing
Below is a basic workflow for submitting your image batches to the cluster.
- Create a folder for your project on your cluster. For high-throughput
analysis, it is recommended to create a separate project folder for each run.
- Within this project folder, create the following folders (both of which must
be connected to the cluster computing network):
- Create an input folder, then transfer all of our images to this folder
as the input folder. The input folder must be readable by everyone (or at least your
cluster) because each of the separate cluster computers will read input files from
this folder.
- Create an output folder where all your output data will be stored. The
output folder must be writeable by everyone (or at least your cluster) because
each of the separate cluster computers will write output files to this folder.
If you cannot create folders and set read/write permissions to these folders (or don't know
how), ask your Information Technology (IT) department for help.
- Press the "View output settings" button. In the panel that appears,
set the Default Input and Default Output Folders
to the images and output folders created above, respectively. The Default Input
Folder setting will only appear if a legacy pipeline is being run.
- Create a pipeline for your image set. You should test it on a few example
images from your image set (if you are unfamilar with the concept of an image set, please
see the help for the Input modules). The module settings selected for your pipeline will be
applied to all your images, but the results may vary
depending on the image quality, so it is critical to insure that your settings be
robust against your "worst-case" images.
For instance, some images may contain no cells. If this happens, the automatic thresholding
algorithms will incorrectly choose a very low threshold, and therefore "find"
spurious objects. This can be overcome by setting a lower limit on the threshold in
the IdentifyPrimaryObjects module.
The Test mode in CellProfiler may be used for previewing the results of your settings
on images of your choice. Please refer to Help > Testing Your Pipeline
for more details on how to use this utility. - Add the CreateBatchFiles module to the end of your pipeline.
This module is needed to resolve the pathnames to your files with respect to
your local machine and the cluster computers. If you are processing large batches
of images, you may also consider adding ExportToDatabase to your pipeline,
after your measurement modules but before the CreateBatchFiles module. This module
will export your data either directly to a MySQL/SQLite database or into a set of
comma-separated files (CSV) along with a script to import your data into a
MySQL database. Please refer to the help for these modules in order learn more
about which settings are appropriate.
- Run the pipeline to create a batch file. Click the Analyze images
button and the analysis will begin processing locally. Do not be surprised if this initial step
takes a while since CellProfiler must first create the entire image set list based
on your settings in the Input modules (this process can be sped
up by creating your list of images as a CSV and using the LoadData module to load it).
With the CreateBatchFiles module in place, the pipeline will not process all
the images, but instead will creates a batch file (a file called
Batch_data.h5) and save it in the Default Output Folder (Step 1). The advantage of
using CreateBatchFiles from the researcher's perspective is that the Batch_data.h5
file generated by the module captures all of the data needed to run the analysis. You
are now ready to submit this batch file to the cluster to run each of the batches
of images on different computers on the cluster.
- Submit your batches to the cluster. Log on to your cluster, and navigate
to the directory where you have installed CellProfiler on the cluster.
A single batch can be submitted with the following command:
./python-2.6.sh CellProfiler.py -p <Default_Output_Folder_path>/Batch_data.h5 -c -r -b -f <first_image_set_number> -l <last_image_set_number>
This command submits the batch file to CellProfiler and specifies that CellProfiler run in a
batch mode without its user interface to process the pipeline.
This run can be modified by using additional options to CellProfiler that
specify the following (type "CellProfiler.py -h" to see a list of available options):
-p <Default_Output_Folder_path>/Batch_data.h5
: The
location of the batch file, where <Default_Output_Folder_path> is the
output folder path as seen by the cluster computer.
-c
: Run "headless", i.e., without the GUI
-r
: Run the pipeline specified on startup, which is contained in
the batch file.
-b
: Do not build extensions, since by this point, they should
already be built.
-f <first_image_set_number>
: Start processing with the image
set specified, <first_image_set_number>
-l <last_image_set_number>
: Finish processing with the image
set specified, <last_image_set_number>
Typically, a user will break a long image set list into pieces and execute each of
these pieces using the command line switches, -f
and -l
to
specify the first and last image sets in each job. A full image set would then need
a script that calls CellProfiler with these options with sequential image set numbers,
e.g, 1-50, 51-100, etc to submit each as an individual job.
>If you need help in producing the batch commands for submitting your jobs, use the
--get-batch-commands
along with the -p
switch to specify the
Batch_data.h5 file output by the CreateBatchFiles module. When specified, CellProfiler
will output one line to the terminal per job to be run. This output should be further
processed to generate a script that can invoke the jobs in a cluster-computing context.
The above notes assume that you are running CellProfiler using our source code (see
"Developer's Guide" under Help for more details). If you are using the compiled version,
you would replace ./python-2.6.sh CellProfiler.py
with the CellProfiler
executable file itself and run it from the installation folder.
Once all the jobs are submitted, the cluster will run each batch individually
and output any measurements or images specified in the pipeline. Specifying the output filename
using the -o
switch when
calling CellProfiler will also produce an output file containing the measurements
for that batch of images in the output folder. Check the output from the batch
processes to make sure all batches complete. Batches that fail for transient reasons
can be resubmitted.
For additional help on batch processing, refer to our
wiki if installing CellProfiler on a Unix system,
our wiki on
adapting CellProfiler to a LIMS environment, or post your questions on
the CellProfiler CPCluster forum.